Abstract
RNAseq analysis has become a standard practice in molecular biology laboratories worldwide. Numerous tools have been developed to perform differential expression analysis, functional enrichment, gene co-expression clustering, and data visualization from RNAseq data. However, many of these tools require separate workflows and, in many cases, advanced programming skills. RNAseqEasy integrates several of these tools into a single, user-friendly workflow, aiming to simplify the process so that only basic knowledge of R programming is required. It is mainly based on the DESeq21, topGO2 and WGCNA3 packages to perform PCA exploratory plots, differential expression analysis, Gene Ontology enrichment analysis and gene co-expression clustering from Salmon4 quantified data.This vignette provides a comprehensive guide to using the RNAseqEasy package and demonstrates typical RNAseq analysis workflows.
RNAseqEasy package version: 0.0.0.9000
The aim of this package is to simplify RNAseq analysis by integrating some pre-existing widely used packages (DESeq21, topGO2, WGCNA3 and rrvgo5, in a tidyverse6 environment) to obtain results and visualization after performing differential expression analysis, PCA, Gene Ontology enrichment and gene co-expression clustering.
The data used as example in this vignette was generated in the work of Liang et al.7 in the liverwort Marchantia polymorpha.
The main function in RNAseqEasy package is
DESeq2_simple(), which performs principal component
analysis, differential expression analysis and Gene Ontology enrichment
analysis from quantified transcript data.
DiffExprAn <- DESeq2_simple(
output_path = output_Dir_DESeq2, # Path to save results
sampleDir = sampleDir, # Directory with your Salmon quantified data
sample_table = sample_table, # Data frame with your sample names and variables
tx2gene = Marchantia7_tx2gene, # Data frame with your Gene - Transcript relationship
Variable = c("Variable1", "Variable2"), # Variables from sample_table affecting your analysis
Design = "Variable1 + Variable2 + Variable1:Variable2", # How to model samples
Name = "Name_comparison", # Name to include in result files
Contrast = list(c("ContrastA", "ContrastB")) or ContrastA - ContrastB,
geneID2GO = geneID2GO, # GO functional annotation
orgdb = "org.At.tair.db" # Reference organism from rrvgo database
)DESeq2_simple() function general scheme
The RNAseqEasy package is designed to work with data
mapped and quantified by Salmon. The first required input is the path
[1] to the directory where quant.sf sample files are saved.
The second key input is a data frame [2] that contains the sample names
and their corresponding descriptive information (i.e. variables or
conditions defining the experiment).
Salmon4 is a widely used
software for quantifying transcript abundance. A well detailed tutorial
can be found in their official
webpage. It is a very fast tool for transcript quantification
directly from fastq.gzfiles. The only requirement is the
creation of an index of the transcriptome of the
organism we are working with (do not use the genome!!!), also explained
in their tutorial.
As a result, we will obtain a folder [1] containing one folder per
analyzed sample with its names. In each folder, the main output file
will be named quant.sf, which contains the name of each
transcript and their abundance in Transcripts Per Million (TPM), among
other information (length, effective length and number of reads). The
path to this output folder will be the first input for our analysis
[1].
In this example, we will download the quant.sf files
from Liang et al.7
experiment.
# URL from Zenodo
data_url <- "https://zenodo.org/records/15800134/files/GSE275561.zip?download=1"
# Temporal directory to download data
output_dir <- file.path(tempdir(), "GSE275561")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
zip_file <- file.path(tempdir(), "GSE275561.zip")
# Download zip file with Salmon quantified data from Zenodo
download.file(url = data_url, destfile = zip_file, mode = "wb")
# Unzip compressed file
unzip(zip_file, exdir = output_dir)
list.files(output_dir)## [1] "__MACOSX" "02_Salmon" "Sample_Data_Wendy.txt"
So, we set the “02_Salmon” folder as sampleDir, which
will direct the analysis to the directory to retrieve quantified data
for each sample.
For each analysis, you need to create a data frame [2] (which we’ll
call sample_table) that links the name of each sequenced
sample with its descriptive variables. You can create it in R or import
it from a file. It has to include a Sample column including
sample names, and an extra column for each variable of factor that were
included in the experiment design. In our example, we include 12
different samples (Wendy1 to Wendy12), and there were 3 different
variables describing them:
Genotype. Samples were divided in two different
genotypes: ‘Tak1’ (wild-type organisms) and ‘gh3a’ (CRISPR-Cas9
mutants).Treatment. Two different treatments were applied to
samples: ‘Mock’ or ‘OPDA’ (exposure to a high concentration of a plant
stress hormone from jasmonic acid family).Replicate. Each category of samples included 3
independent biological replicates.# Data frame with 'Sample', 'Genotype', 'Treatment and 'Replicate' information
sample_table <- data.frame(
Sample = paste0("Wendy", seq(1,12)),
Genotype = rep(c("Tak1", "gh3a"), each = 6),
Treatment = rep(c("Mock", "OPDA"), each = 3),
Replicate = seq(1,3)
)
sample_table## Sample Genotype Treatment Replicate
## 1 Wendy1 Tak1 Mock 1
## 2 Wendy2 Tak1 Mock 2
## 3 Wendy3 Tak1 Mock 3
## 4 Wendy4 Tak1 OPDA 1
## 5 Wendy5 Tak1 OPDA 2
## 6 Wendy6 Tak1 OPDA 3
## 7 Wendy7 gh3a Mock 1
## 8 Wendy8 gh3a Mock 2
## 9 Wendy9 gh3a Mock 3
## 10 Wendy10 gh3a OPDA 1
## 11 Wendy11 gh3a OPDA 2
## 12 Wendy12 gh3a OPDA 3
In differential expression analysis, comparisons are performed against a reference level. By default, R sets levels based on alphabetical order for each variable. While you can specify the desired comparison in each analysis, it is good practice to set the factor level order beforehand. We transform each variable column into factor and establish the level order. In this example, “Tak-1” is the reference genotype, and “Mock” is the reference treatment.
# Convert 'Genotype' into factor, setting "Tak1" as reference
sample_table$Genotype <- factor(sample_table$Genotype,
levels = c("Tak1", "gh3a"))
# Convert 'Treatment' into factor, setting "Mock" as reference
sample_table$Treatment <- factor(sample_table$Treatment, levels = c("Mock", "OPDA"))
str(sample_table)## 'data.frame': 12 obs. of 4 variables:
## $ Sample : chr "Wendy1" "Wendy2" "Wendy3" "Wendy4" ...
## $ Genotype : Factor w/ 2 levels "Tak1","gh3a": 1 1 1 1 1 1 2 2 2 2 ...
## $ Treatment: Factor w/ 2 levels "Mock","OPDA": 1 1 1 2 2 2 1 1 1 2 ...
## $ Replicate: int 1 2 3 1 2 3 1 2 3 1 ...
The remaining data needed is specific to your organism of interest, not the experiment itself. There are two different files that we will need to import for the analysis:
In this example, we are working with samples extracted from the
liverwort Marchantia polymorpha, a model plant widely use to
study how ancient or conserved biological processes are in plant
evolution. To get the reference transcriptome, we download it from marchantia.info, the Genome Database
for Marchantia polymorpha. The get.fasta.name()
function from phylotools8
package allows us to obtain the transcript names from the fasta file. In
Marchantia, gene names and transcript share the same
nomenclature, only adding a dot and a number to the different
transcripts of a gene. To correlate them, we create a data frame
including a column called Name containing transcript names,
and based on that we create a new column called GENEID
removing last 2 characters (using str_sub() function from
string package).
library(phylotools)
library(tidyverse)
URL <- gzcon(url(paste("https://marchantia.info/download/MpTak_v6.1/", "MpTak_v6.1r1.mrna.fasta.gz", sep="")))
txt <- readLines(URL)
Marchantia7_Transcripts <- get.fasta.name(textConnection(txt), clean_name = FALSE)
head(Marchantia7_Transcripts)## [1] "Mp1g00070.1" "Mp1g00080.1" "Mp1g00090.1" "Mp1g00090.2" "Mp1g00090.3"
## [6] "Mp1g00090.4"
Marchantia7_tx2gene <- data.frame(Name = Marchantia7_Transcripts) %>%
tidyr::separate(Name, sep = " ", into = c("TXNAME", "CDS")) %>%
dplyr::mutate(GENEID = stringr::str_sub(TXNAME, 1, -3)) %>%
dplyr::select(-CDS)
head(Marchantia7_tx2gene, 10)## TXNAME GENEID
## 1 Mp1g00070.1 Mp1g00070
## 2 Mp1g00080.1 Mp1g00080
## 3 Mp1g00090.1 Mp1g00090
## 4 Mp1g00090.2 Mp1g00090
## 5 Mp1g00090.3 Mp1g00090
## 6 Mp1g00090.4 Mp1g00090
## 7 Mp1g00090.5 Mp1g00090
## 8 Mp1g00090.6 Mp1g00090
## 9 Mp1g00100.1 Mp1g00100
## 10 Mp1g00110.1 Mp1g00110
For this example, functional annotation of Gene Ontologies (GOs) for Marchantia genes can be downloaded from the package in a csv file. Many organisms reference databases include this kind of files. If you cannot find them, try to find tools and tutorials to annotate your reference genome.
library(RNAseqEasy)
GO_data <- system.file("extdata",
"Mpo6.1_GO_db_GeneGO.csv",
package="RNAseqEasy", mustWork=TRUE)
Mpo_GO_GOSLIM <- read.csv(GO_data, sep = "\t")
head(Mpo_GO_GOSLIM)## Transcript GO
## 1 Mp1g00060 GO:0003676
## 2 Mp1g00060 GO:0032774
## 3 Mp1g00060 GO:0034062
## 4 Mp1g00060 GO:0042644
## 5 Mp1g00060 GO:0046914
## 6 Mp1g00070 GO:0000075
This annotation file is a two column data frame correlating
transcript names and GO terms (one assotiation per row). For
topGO enrichment analysis, we will need to convert this
data frame into a list where each element is a transcript ID (the key)
and its content is a vector of associated GO terms (the values). The
load_topGO_db() function included in this package helps
with this purpose.
## $Mp1g00060
## [1] "GO:0003676" "GO:0032774" "GO:0034062" "GO:0042644" "GO:0046914"
##
## $Mp1g00070
## [1] "GO:0000075" "GO:0000123" "GO:0000725" "GO:0000790" "GO:0000800"
## [6] "GO:0000976" "GO:0001894" "GO:0003682" "GO:0003700" "GO:0004842"
## [11] "GO:0005704" "GO:0005759" "GO:0005813" "GO:0005829" "GO:0005886"
## [16] "GO:0006302" "GO:0007623" "GO:0008023" "GO:0008157" "GO:0008270"
## [21] "GO:0009653" "GO:0010074" "GO:0010332" "GO:0016458" "GO:0016607"
## [26] "GO:0031057" "GO:0031062" "GO:0031436" "GO:0032409" "GO:0032786"
## [31] "GO:0034243" "GO:0035065" "GO:0036464" "GO:0040029" "GO:0042127"
## [36] "GO:0042800" "GO:0042803" "GO:0042981" "GO:0043970" "GO:0044030"
## [41] "GO:0044093" "GO:0044648" "GO:0044666" "GO:0045322" "GO:0045944"
## [46] "GO:0048872" "GO:0051050" "GO:0051053" "GO:0051569" "GO:0065003"
## [51] "GO:0070531" "GO:0070577" "GO:0071339" "GO:0071479" "GO:0080182"
## [56] "GO:0099402" "GO:1902750" "GO:1903507" "GO:1990904" "GO:2000113"
## [61] "GO:2001025" "GO:2001038"
##
## $Mp1g00080
## [1] "GO:0005794" "GO:0009570" "GO:0009941"
##
## $Mp1g00090
## [1] "GO:0005773" "GO:0005829" "GO:0012505" "GO:0031410" "GO:0034272"
## [6] "GO:0098588" "GO:0098805"
##
## $Mp1g00100
## [1] "GO:0000137" "GO:0000138" "GO:0005634" "GO:0005768" "GO:0005797"
## [6] "GO:0005802" "GO:0007155" "GO:0008194" "GO:0009507" "GO:0009832"
## [11] "GO:0010214" "GO:0010395" "GO:0031224" "GO:0045489" "GO:0070592"
## [16] "GO:0080157"
##
## $Mp1g00110
## [1] "GO:0000491" "GO:0001094" "GO:0005634" "GO:0005737" "GO:0042802"
## [6] "GO:0051117" "GO:0070062" "GO:0070761" "GO:0097159" "GO:1901363"
To perform GO semantic clustering, the analysis is based on rrvgo5 and GOSemSim9 packages. The available species for this analysis are included in Bioconductor webpage. The only plant organism included is Arabidopsis thaliana, so we will use its code (org.At.tair.db) for this analysis. We can save this data in an object, so it will save computing time in the following analysis. There are three classes of GO terms: Biological Processes (BP), Molecular Functions (MF) and Cell Components (CC). In this example, we will only be focused on biological processes.
At this point, we already have everything we need to perform our
RNA-seq analysis: a directory with our quantified transcript data
(sampleDir), a data frame with the experiment metadata
(sample_table), a data frame correlating gene names and
transcript names for our organism (Marchantia7_tx2gene), an
annotation list correlating transcript names and their annotated GO
terms (geneID2GO) and the model organism reference data for
the semantic clustering (save).
We want to obtain differential expressed genes (DEGs) in Tak-1 genotype in OPDA treatment compared to Tak-1 genotype in Mock conditions. To organize our results, we will create a folder to save all differential expression analyses and we will call it “DESeq2”, and inside we will create another folder to save specifically this comparison, and we will call it “Tak1_OPDA”.
dir.create(file.path(output_dir, "DESeq2"))
dir.create(file.path(output_dir, "DESeq2", "Tak1_OPDA"))
output_Dir_DESeq2 <- file.path(output_dir, "DESeq2", "Tak1_OPDA")To run the analysis, we should install DESeq2 and topGO packages from Bioconductor and load them in our R session.
Now, we are ready to run our differential expression analysis. For
that, we will use DESeq2_simple function. We set all
required parameters that we previously defined
(output_path, sampleDir,
sample_table, tx2gene, geneID2GO,
orgdb and semdata). But, still, there are some
parameters that we have to define for this analysis:
For more information about contrasts, read the DESeq2 vignette and check this tutorial.
library(DESeq2)
library(topGO)
DESeq2_simple(
output_path = output_Dir_DESeq2,
sampleDir = sampleDir,
sample_table = sample_table,
tx2gene = Marchantia7_tx2gene,
geneID2GO = geneID2GO,
orgdb = "org.At.tair.db",
semdata = save,
Name = "Tak1_OPDA",
Variable = c("Genotype", "Treatment"),
Design = "Genotype + Treatment + Genotype:Treatment"
)## [1] These are the DESeq2 contrasts, that you have to use like list(c("ContrastA", "ContrastB")) :
## [1] "Intercept" "Genotype_gh3a_vs_Tak1"
## [3] "Treatment_OPDA_vs_Mock" "Genotypegh3a.TreatmentOPDA"
## [1] These are the costumized contrasts, that you have to use like ContrastA - ContrastB :
## [1] "Tak1" "gh3a" "Mock" "OPDA" "Tak1_Mock" "Tak1_OPDA"
## [7] "gh3a_Mock" "gh3a_OPDA"
## Error in DESeq2_simple(output_path = output_Dir_DESeq2, sampleDir = sampleDir, : Choose one of the previous contrasts
After running DESeq2_simple with the
Contrast parameter omitted, an error message appears,
prompting us to choose one of the contrasts printed to the console. In
this example, we want to compare Tak-1 genotype in OPDA treatment
samples vs. Tak-1 genotype in Mock conditions. There are two different
ways we can set this contrast:
Tak1_OPDA_result <- DESeq2_simple(
output_path = output_Dir_DESeq2,
sampleDir = sampleDir,
sample_table = sample_table,
Include = NULL,
Exclude = NULL,
tx2gene = Marchantia7_tx2gene,
Variable = c("Genotype", "Treatment"),
Design = "Genotype + Treatment + Genotype:Treatment",
Group = "NO",
Name = "Tak1_OPDA",
Contrast = Tak1_OPDA - Tak1_Mock,
Reduced = FALSE,
log2FCtopGO = 1,
geneID2GO = geneID2GO,
ontology = "BP",
plot_similarity = TRUE,
orgdb = "org.At.tair.db",
semdata = save
)If we want to perform other possible analyses, these would be the contrast we would have to use:
OPDA vs Mock in gh3a genotype: gh3a_OPDA - gh3a_Mock, or list(c(“Treatment_OPDA_vs_Mock”, “Genotypegh3a.TreatmentOPDA”)).
gh3a vs Tak-1 in Mock conditions: gh3a_Mock - Tak1_Mock, or list(c(“Genotype_gh3a_vs_Tak1”)).
gh3a vs Tak-1 in OPDA conditions: gh3a_OPDA - Tak1_OPDA, or list(c(“Genotype_gh3a_vs_Tak1”, “Genotypegh3a.TreatmentOPDA”)).
Genes affected by the Genotype:Treatment interaction: (gh3a_OPDA - gh3a_Mock) - (Tak1_OPDA - Tak1_Mock), or list(c(“Genotypegh3a.TreatmentOPDA”)). This are genes that will be responding to OPDA treatment in the gh3a genotype in a different way that the expected from the response in Tak-1 genotype.
dir.create(file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction"))
output_Dir_interaction <- file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction")
Genotype_Treatment_result <- DESeq2_simple(
output_path = output_Dir_interaction,
sampleDir = sampleDir,
sample_table = sample_table,
Include = NULL,
Exclude = NULL,
tx2gene = Marchantia7_tx2gene,
Variable = c("Genotype", "Treatment"),
Design = "Genotype + Treatment + Genotype:Treatment",
Group = "NO",
Name = "Genotype_Treatment_interaction",
Contrast = list(c("Genotypegh3a.TreatmentOPDA")),
Reduced = FALSE,
log2FCtopGO = 1,
geneID2GO = geneID2GO,
ontology = "BP",
plot_similarity = TRUE,
orgdb = "org.At.tair.db",
semdata = save
)By default, all samples in sampleDir will be included
for the differential expression analysis using
DESeq2_simple function. However, it could be interesting to
perform an analysis including only a subset of these samples. For
example, let’s think about the scenario where we want to study DEGs by
OPDA in Tak-1 genotype ignoring the existence of gh3a in the
experiment. Thus, we would only want to include samples belonging to
Tak-1 genotype. There are two possible ways of doing that in
DESeq2_simple function: by using Include or
Exclude parameters.
Include: we define features in a vector that all
samples must present to be included in the analysis. In this example, it
would be Include = c("Tak1").Exclude: we define features in a vector that we want to
remove from the analysis. All samples containing any of these features
will be removed. In this example, it would be
Exclude = c("gh3a").In our example, both parameters would work the same way, i.e., only including Tak-1 samples in the analysis, both in OPDA and Mock treatments.
Apart from differential expression analysis, all samples included in
the DESeq2_simple function will be compute for principal
component analysis and we will obtain a png file of PC1 vs PC2 using 500
top genes by default. If we do not want to generate the PCA, we can set
the PCA parameter to FALSE within the
DESeq2_simple function. If we want to perform a different
principal component analysis (plotting other component, using a
different number of top genes, changing colors, etc…) we can do it by
running the run_pca_analysis function.
After running DESeq2_simple function, there are some
files generated with results from the differential gene expression
analysis:
## baseMean log2FoldChange lfcSE stat pvalue
## Mp1g00070 147.7289 -0.87073377 0.2215426 -3.9303223 8.483207e-05
## Mp1g00080 8077.3196 -0.54069976 0.1329801 -4.0660199 4.782285e-05
## Mp1g00090 531.2465 -0.05551592 0.1356234 -0.4093389 6.822909e-01
## Mp1g00100 1127.5365 0.87159546 0.1302033 6.6941140 2.169823e-11
## Mp1g00110 419.9734 -0.10280589 0.1763512 -0.5829610 5.599196e-01
## Mp1g00120 511.5859 -0.35720864 0.1180411 -3.0261372 2.476999e-03
## padj
## Mp1g00070 3.932574e-04
## Mp1g00080 2.335654e-04
## Mp1g00090 7.772935e-01
## Mp1g00100 2.389950e-10
## Mp1g00110 6.791640e-01
## Mp1g00120 7.944981e-03
Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype
PCA parameter is
TRUE (which is the default) (PCA_Name.png).PCA plot of PC1 vs PC2, using 500 top genes
## Annotated Significant Expected
## GO:0044238 6226 509 528.04
## GO:0045893 688 59 58.35
## GO:0048574 90 9 7.63
## GO:0010154 1454 150 123.32
## GO:0044706 462 48 39.18
## GO:0044281 1878 198 159.28
## Term pval
## GO:0044238 primary metabolic process 0.0061337475
## GO:0045893 positive regulation of DNA-templated transcription 0.0060918982
## GO:0048574 long-day photoperiodism, flowering 0.0049341244
## GO:0010154 fruit development 0.0002647838
## GO:0044706 multi-multicellular organism process 0.0321583675
## GO:0044281 small molecule metabolic process 0.0091339926
## Enrichment
## GO:0044238 1.006982
## GO:0045893 1.056274
## GO:0048574 1.231723
## GO:0010154 1.270691
## GO:0044706 1.279712
## GO:0044281 1.298621
A tree map plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Treemap.pdf).
A scatter plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Scatterplot.pdf).
A bubble plot of all enriched GO terms in descendant order according to their enrichment (topGO_Name_All/Up/Down_bubble.pdf).
Tree map of Gene Ontologies enriched in Up-regulated genes
Semantic cluster of Gene Ontologies enriched in Up-regulated genes
Bubble plot of Gene Ontologies enriched in Up-regulated genes
WGCNA_Module() function general scheme
Another application of the package is the generation of co-expressed
gene modules by the implementation of components from the WGCNA3 package. This functionality is
included in the WGCNA_Modules function. As
DESeq2_simple function, it requires to specify the
output_path, sampleDir,
sample_table, Variables, tx2gene,
Name, geneID2GO and Annotation
parameters. The specific parameters that have also to be included in
this function are:
DEGs: A vector of Gene IDs. This will be the input set
of genes that we want to cluster. The Gene IDs must correspond to the
transcriptome that we are using in the analysis.Power: Soft thresholding power used in the network
construction (check the WGNA
manual or tutorial
for more info). If you run the WGCNA_Modules function
without specifying the Power parameter, it will stop and
generate a “softthresholdingpower.pdf” file to help you select an
appropriate power value. As an alternative, this table suggests the
power to select depending on the number of samples and type of network:
Most of the advanced parameters in for the
blockwiseModules function of the WGCNA package for the
module generation are set to default.
As an example, we will use the output DEGs from the
Genotype:Treatment interaction in DESeq2_simple, i.e.,
genes that significantly respond to the OPDA treatment in the
gh3a genotype in a different way than the Tak-1 genotype. This
different response can be in different ways, so we will use the gene
co-expression modules to explore these responses.
DESeq2_simple_output <- read.table(file.path(output_Dir_interaction, "Genotype_Treatment_interaction.txt"),
header = TRUE)
## We get rownames (GeneIDs) of the output data frame from DESeq2_simple(),
## applying a filter of log2FC > 1 for Up-regulated DEGs and log2FC < -1 for
## Down-regulation.
DEGs <- rownames(DESeq2_simple_output %>% filter(log2FoldChange > 1 | log2FoldChange < -1))Now that we already have selected our input gene set, we can run the
WGCNA_Modules function:
library(WGCNA)
dir.create(file.path(output_dir, "WGCNA"))
output_WGCNA <- file.path(output_dir, "WGCNA")
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
sampleDir = sampleDir,
sample_table = sample_table,
DEGs = DEGs,
Variables = c("Genotype", "Treatment"),
tx2gene = Marchantia7_tx2gene,
Name = "Genotype_Treatment_Interaction",
NumberCol = 1,
geneID2GO = geneID2GO,
semdata = save)## Flagging genes and samples with too many missing values...
## ..step 1
## pickSoftThreshold: will use block size 886.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 886 of 886
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.263 0.783 0.143 331.00 327.000 483.0
## 2 2 0.403 -0.390 0.352 179.00 158.000 345.0
## 3 3 0.762 -0.636 0.874 115.00 86.100 274.0
## 4 4 0.825 -0.726 0.933 81.60 50.200 230.0
## 5 5 0.857 -0.769 0.957 61.60 31.700 199.0
## 6 6 0.886 -0.799 0.980 48.60 21.000 175.0
## 7 7 0.883 -0.823 0.945 39.60 15.400 158.0
## 8 8 0.908 -0.837 0.958 33.10 11.700 143.0
## 9 9 0.919 -0.840 0.963 28.20 8.820 131.0
## 10 10 0.930 -0.856 0.966 24.40 6.890 121.0
## 11 12 0.935 -0.865 0.970 18.90 4.310 105.0
## 12 14 0.944 -0.874 0.969 15.10 2.870 92.3
## 13 16 0.956 -0.880 0.979 12.50 1.980 82.4
## 14 18 0.955 -0.878 0.973 10.50 1.430 74.4
## 15 20 0.952 -0.898 0.969 8.96 1.090 67.6
## 16 22 0.962 -0.905 0.970 7.76 0.806 62.0
## 17 24 0.968 -0.905 0.970 6.79 0.597 57.1
## 18 26 0.975 -0.900 0.979 6.00 0.444 53.1
## 19 28 0.973 -0.905 0.970 5.35 0.360 49.5
## 20 30 0.976 -0.919 0.973 4.80 0.278 46.3
## 21 32 0.977 -0.930 0.972 4.33 0.218 43.5
## 22 34 0.982 -0.939 0.979 3.93 0.174 41.0
## 23 36 0.977 -0.931 0.973 3.58 0.137 38.7
## 24 38 0.985 -0.934 0.982 3.28 0.109 36.6
## 25 40 0.974 -0.936 0.968 3.02 0.086 34.7
## Error in WGCNA_Modules(output_path = output_WGCNA, sampleDir = sampleDir, : Choose a power based on softthresholdingpower.pdf
We did not include the Power parameter, so the function
stopped and warned us to select one based on the
“softthresholdingpower.pdf” file.
R^2 values as a function of the soft thresholds
Based on this plot, we will select 7 as the soft thresholding power since it is the first value that rises the threshold.
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
sampleDir = sampleDir,
sample_table = sample_table,
DEGs = DEGs,
Variables = c("Genotype", "Treatment"),
tx2gene = Marchantia7_tx2gene,
Name = "Genotype_Treatment_Interaction",
Power = 7,
NumberCol = 1,
geneID2GO = geneID2GO,
semdata = save)## Flagging genes and samples with too many missing values...
## ..step 1
## pickSoftThreshold: will use block size 886.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 886 of 886
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.263 0.783 0.143 331.00 327.000 483.0
## 2 2 0.403 -0.390 0.352 179.00 158.000 345.0
## 3 3 0.762 -0.636 0.874 115.00 86.100 274.0
## 4 4 0.825 -0.726 0.933 81.60 50.200 230.0
## 5 5 0.857 -0.769 0.957 61.60 31.700 199.0
## 6 6 0.886 -0.799 0.980 48.60 21.000 175.0
## 7 7 0.883 -0.823 0.945 39.60 15.400 158.0
## 8 8 0.908 -0.837 0.958 33.10 11.700 143.0
## 9 9 0.919 -0.840 0.963 28.20 8.820 131.0
## 10 10 0.930 -0.856 0.966 24.40 6.890 121.0
## 11 12 0.935 -0.865 0.970 18.90 4.310 105.0
## 12 14 0.944 -0.874 0.969 15.10 2.870 92.3
## 13 16 0.956 -0.880 0.979 12.50 1.980 82.4
## 14 18 0.955 -0.878 0.973 10.50 1.430 74.4
## 15 20 0.952 -0.898 0.969 8.96 1.090 67.6
## 16 22 0.962 -0.905 0.970 7.76 0.806 62.0
## 17 24 0.968 -0.905 0.970 6.79 0.597 57.1
## 18 26 0.975 -0.900 0.979 6.00 0.444 53.1
## 19 28 0.973 -0.905 0.970 5.35 0.360 49.5
## 20 30 0.976 -0.919 0.973 4.80 0.278 46.3
## 21 32 0.977 -0.930 0.972 4.33 0.218 43.5
## 22 34 0.982 -0.939 0.979 3.93 0.174 41.0
## 23 36 0.977 -0.931 0.973 3.58 0.137 38.7
## 24 38 0.985 -0.934 0.982 3.28 0.109 36.6
## 25 40 0.974 -0.936 0.968 3.02 0.086 34.7
## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 1 into file RNAseqEasy-block.1.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 8 genes from module 1 because their KME is too low.
## ..removing 2 genes from module 2 because their KME is too low.
## ..removing 5 genes from module 3 because their KME is too low.
## ..removing 1 genes from module 4 because their KME is too low.
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
There are 6 different clusters or modules that have been generated after the analysis. The average behavior of genes from each category are summarized in the “Summary_Modules_Name.pdf” file:
Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction
A tab-delimited TXT file for each module, containing genes included in that module and its TPM expression in each sample of the analysis (Module_genes.txt).
A CSV file of two columns, including all genes used as input
(DEGs parameter) and the module name where they were
included (Name_geneInfo.csv).
Three PDF files:
Power parameter (softthresholdingpower.pdf)NumberCol parameter, and the colors of the lines specifying
a named vector in the Colors_plot parameter (colors as
values, variables as names).A “topGO” folder, including same output files for each module as
DESeq2_simple() function.
topGO_All() function general scheme
Gene Ontology Enrichment analysis in this package is based on
topGO2 package, and the
function that performs it is topGO_All(). This function is
also integrated in the DESeq2_simple() and
WGCNA_Module() functions. It requires a vector or data
frame with transcript IDs (if it is a data frame, it should follow the
DESeq2_simple() output format, where Gene IDs are row
names). It also requires the Name, Annotation
and orgdb parameters.
ontology parameter.algorithm is “weight01”, but it can be changed
to “classic”, “elim”, “weight”, “lea” or “parentchild” (read topGO
vignette for more information).statistic is “fisher”, but it can be changed to
“ks”, “t”, “globaltest” or “sum” (read topGO
vignette for more information).save_GeneNames parameter
to TRUE and include a character vector with the GO IDs in
the Ontologies parameter. It will generate a XLSX file with
a spreadsheet for each ontology. Be sure that every GO ID is represented
as an output of the topGO_All() function.